Data Integrity Suite

The suite is composed of various checks such as: Under Annotated Property Segments, Under Annotated Meta Data Segments, Property Label Correlation, etc...
Each check may contain conditions (which will result in pass / fail / warning ! / error ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Conflicting Labels - Train Dataset Ambiguous sample ratio is less or equal to 0% Ratio of samples with conflicting labels: 0.21%
Special Characters - Train Dataset Ratio of samples containing more than 20% special characters is below 5% Found 1 samples with special char ratio above threshold
Text Duplicates - Test Dataset Duplicate data ratio is less or equal to 5% Found 1.36% duplicate data
Text Duplicates - Train Dataset Duplicate data ratio is less or equal to 5% Found 2.58% duplicate data
Special Characters - Test Dataset Ratio of samples containing more than 20% special characters is below 5% Found 0 samples with special char ratio above threshold
Conflicting Labels - Test Dataset Ambiguous sample ratio is less or equal to 0% Ratio of samples with conflicting labels: 0%
Unknown Tokens - Train Dataset Ratio of unknown words is less than 0% Ratio was 0%
Unknown Tokens - Test Dataset Ratio of unknown words is less than 0% Ratio was 0%
Frequent Substrings - Train Dataset No more than 1 substrings with ratio above 0.05 Found 0 substrings with ratio above threshold
Frequent Substrings - Test Dataset No more than 1 substrings with ratio above 0.05 Found 0 substrings with ratio above threshold

Check With Conditions Output

Conflicting Labels - Train Dataset

Find identical samples which have different labels. Read More...

Conditions Summary
Status Condition More Info
Ambiguous sample ratio is less or equal to 0% Ratio of samples with conflicting labels: 0.21%
Additional Outputs
Each row in the table shows an example of a data sample and the its observed conflicting labels as found in the dataset.
    Text
Observed Labels Sample IDs  
1, 0, 0, 0, 0 249, 571, 1729, 2243, 2370 يسقط الانقلاب

Go to top

Text Duplicates - Train Dataset

Checks for duplicate samples in the dataset. Read More...

Conditions Summary
Status Condition More Info
Duplicate data ratio is less or equal to 5% Found 2.58% duplicate data
Additional Outputs
2.58% of data samples are duplicates.
Each row in the table shows an example of a text duplicate and the number of times it appears.
Text Sample IDs Number of Samples
كلنا قيس السعيد 578, 1468, 1755 3
قيسون ولد الشعب امعاه ربي والش... 1107, 1960 2
فاسد يريد ان يكون رئيس دوله 1005, 1227 2
يحيا قيسون 1058, 1417 2
الزقفونه بواس لكتاف انفضح 712, 2129 2

Go to top

Text Duplicates - Test Dataset

Checks for duplicate samples in the dataset. Read More...

Conditions Summary
Status Condition More Info
Duplicate data ratio is less or equal to 5% Found 1.36% duplicate data
Additional Outputs
1.36% of data samples are duplicates.
Each row in the table shows an example of a text duplicate and the number of times it appears.
Text Sample IDs Number of Samples
قيس سعيد 132, 441 2
يحيا قيس سعيد 260, 453 2
المرزوقي يشرف كل تونسي حر ويكف... 285, 312 2
كلنا قيس سعيد 163, 238 2
الناس الكل عندها الحق تتكلم ال... 437, 439 2

Go to top

Special Characters - Train Dataset

Find samples that contain special characters and also the most common special characters in the dataset. Read More...

Conditions Summary
Status Condition More Info
Ratio of samples containing more than 20% special characters is below 5% Found 1 samples with special char ratio above threshold
Additional Outputs
1.5% of samples contain special characters
List of ignored special characters: ['*', ',', '<', '+', '=', '^', '>', '_', '\\', ' ', ';', '[', '(', '-', '/', '`', '}', '%', '~', ')', '|', '.', '"', '#', '$', ':', ']', '?', '{', '&', '!', '@', "'"]
Sample ID % of Special Characters Special Characters Text
346 0.29 ['َ', 'ِ', 'ْ', 'ُ', 'ّ'] الشَعْب يُرِيد اِسْقَاط وَ عَزْل المُسمَي وَ المَدْعُو قَيْس سَعِيد المُجْرِم الدِكْتَاتُور الاِنقلَ
1672 0.06 ['ُ', 'ٌ'] تحيا تونس ورئيسها السيٌدقيُس سعيُد رئيس الجمهوريه
2161 0.05 ['َ'] لم يغادوَروا الاخوان
380 0.04 ['ّ'] عيّن الياس الفخفاخ و فشل عيّن هشام المشيشي و فشلعيّن بعد الانقلاب نجلاء بودن و فشلتعيّن احمد الحشاني
902 0.03 ['ْ'] كلامه نضري لا يمت للواقع بشيْ

Go to top

Special Characters - Test Dataset

Find samples that contain special characters and also the most common special characters in the dataset. Read More...

Conditions Summary
Status Condition More Info
Ratio of samples containing more than 20% special characters is below 5% Found 0 samples with special char ratio above threshold
Additional Outputs
0.58% of samples contain special characters
List of ignored special characters: ['*', ',', '<', '+', '=', '^', '>', '_', '\\', ' ', ';', '[', '(', '-', '/', '`', '}', '%', '~', ')', '|', '.', '"', '#', '$', ':', ']', '?', '{', '&', '!', '@', "'"]
Sample ID % of Special Characters Special Characters Text
431 0.02 ['ٌ'] انت بعيد صديقي ولا تعلم انه تسبٌب في ازمه تونس بتصرٌفاته الحمقاء لقد كان في فتره رئاسته مجرٌد طرطور
399 0.02 ['ّ'] عجيبه تفسيرات قيس سعيّد لنتائج الانتخابات
302 0.01 ['ّ'] النظام العلماني الشرس المدعوم من الاستعمار الفرنسي والامريكي لا زال يتحكّم في النظام التونسي المعادي
19 0.00 [] تونس اصبحت مهزله
0 0.00 [] رايته عند المومياء المحنط متع الحيمايا تعيس هذا الكف الاول من عند الشعب صفر منتخب والكف الثاني عند ر

Go to top

Check Without Conditions Output


Other Checks That Weren't Displayed

Check Reason
Text Property Outliers - Test Dataset Functionality requires properties, but the the TextData object had none. To use this functionality, use the set_properties method to set your own properties with a pandas.DataFrame or use TextData.calculate_builtin_properties to add the default deepchecks properties.
Under Annotated Property Segments - Train Dataset Functionality requires properties, but the the TextData object had none. To use this functionality, use the set_properties method to set your own properties with a pandas.DataFrame or use TextData.calculate_builtin_properties to add the default deepchecks properties.
Text Property Outliers - Train Dataset Functionality requires properties, but the the TextData object had none. To use this functionality, use the set_properties method to set your own properties with a pandas.DataFrame or use TextData.calculate_builtin_properties to add the default deepchecks properties.
Property Label Correlation - Test Dataset Functionality requires properties, but the the TextData object had none. To use this functionality, use the set_properties method to set your own properties with a pandas.DataFrame or use TextData.calculate_builtin_properties to add the default deepchecks properties.
Property Label Correlation - Train Dataset Functionality requires properties, but the the TextData object had none. To use this functionality, use the set_properties method to set your own properties with a pandas.DataFrame or use TextData.calculate_builtin_properties to add the default deepchecks properties.
Under Annotated Meta Data Segments - Test Dataset Functionality requires metadata, but the the TextData object had none. To use this functionality, use the set_metadata method to set your own metadata with a pandas.DataFrame.
Under Annotated Meta Data Segments - Train Dataset Functionality requires metadata, but the the TextData object had none. To use this functionality, use the set_metadata method to set your own metadata with a pandas.DataFrame.
Under Annotated Property Segments - Test Dataset Functionality requires properties, but the the TextData object had none. To use this functionality, use the set_properties method to set your own properties with a pandas.DataFrame or use TextData.calculate_builtin_properties to add the default deepchecks properties.
Frequent Substrings - Test Dataset Nothing found
Unknown Tokens - Test Dataset Nothing found
Conflicting Labels - Test Dataset Nothing found
Frequent Substrings - Train Dataset Nothing found
Unknown Tokens - Train Dataset Nothing found

Go to top